Transfer learning strategies for solar power forecasting under data scarcity

Accurately forecasting solar plants production is critical for balancing supply and demand and for scheduling distribution networks operation in the context of inclusive smart cities and energy communities. However, the problem becomes more demanding, when there is insufficient amount of data to adequately train forecasting models, due to plants being recently installed or because of lack of smart-meters. Transfer learning (TL) offers the capability of transferring knowledge from the source domain to different target domains to resolve related problems. This study uses the stacked Long Short-Term Memory (LSTM) model with three TL strategies to provide accurate solar plant production forecasts. TL is exploited both for weight initialization of the LSTM model and for feature extraction, using different freezing approaches. The presented TL strategies are compared to the conventional non-TL model, as well as to the smart persistence model, at forecasting the hourly production of 6 solar plants. Results indicate that TL models significantly outperform the conventional one, achieving 12.6% accuracy improvement in terms of RMSE and 16.3% in terms of forecast skill index with 1 year of training data. The gap between the two approaches becomes even bigger when fewer training data are available (especially in the case of a 3-month training set), breaking new ground in power production forecasting of newly installed solar plants and rendering TL a reliable tool in the hands of self-producers towards the ultimate goal of energy balancing and demand response management from an early stage.

www.nature.com/scientificreports/ third section (Results), describing the baseline model performance, the validation process of the proposed TL strategies, and the data availability impact. Finally, the fourth section (Discussion) summarizes the conclusion.

Methods
Experimental data and processing. Hourly PV production and weather data (temperature, humidity and solar irradiance) from 7 PV plants are exploited. PV production data are collected directly from the solar plant systems of a Portuguese energy community, while weather data are extracted from a local meteorological station 34 and the Copernicus Atmosphere Data Store 35 . One PV plant, namely PV 1 , is used as the base model and the other PV plants are used for the development of the TL models. Specific information for the examined PV plants are presented in Table 1.
The PV plants are located in 4 cities in Portugal (4 PVs are located in Lisbon; 1 is located in Setubal, Faro and Braga, respectively) and the available data vary from 14 to 30 months depending on the PV plant. The selected PV plants also differ in terms of nominal and peak capacities. The base PV has a nominal capacity of 23.52 KW, while the target domain PVs' capacity varies from 30 KW to 271.53 KW, which is over 10 times the capacity of the base PV. The rationale behind the selection of the specific PV plants is to assess model performance on PV plants that are located both in the same city ( PV 2 , PV 4 and PV 7 ) and in different cities to the base PV ( PV 3 , PV 5 and PV 6 ), in order to report potential differences in the TL models' forecasting accuracy. The locations of the inspected PV plants are depicted in Fig. 1.
The selected features of the stacked LSTM model are temperature, humidity, solar irradiance, PV production, one-hot encoding representation of the month of the year and sine/cosine transformation of the hour of day. The following processing routine is conducted for each dataset: Firstly, data are normalized to [0, 1] range. Secondly, data are transformed to "5 inputs -1 output" format to be processed by the stacked LSTM model. Thirdly, the datasets are split into train sets and test sets. The base model is trained on the whole dataset of the source domain. The TL models are trained on 12 months of data (the training set consists of 8760 hourly rows) and the remaining data are used for testing. It is obvious that the testing period differs for each target PV based on the total number of available data; PV 7 has the longest testing period, consisting of 13172 hourly rows, while PV 3 has the shortest testing period, consisting of 910 hourly rows. Finally, the same processing routine is implemented for Transfer learning. TL is a technique that focuses on exploiting knowledge gained while solving one problem in order to solve a different problem with similar characteristics. The general concept of TL is transferring the expertise of a model from the source domain to the target domain, relaxing the hypothesis that the data of these two problems must be independent and identically distributed 36 . TL provides numerous advantages, namely reduced training time 37 , improved NN performance and, more importantly, the opportunity to achieve high accuracy with limited amount of data 38 . The general framework of the TL process is presented in Fig. 2.
According to the formal definition of TL proposed by Pan and Yang 39 : "Given a source domain D S and learning task T S , a target domain D T and learning task T T , where D S = D T , or T S = T T , TL aims to help improve the learning of the target predictive function f T (·) in D T using the knowledge in D S and T S ". This definition is better understood by defining the concepts of domain and task. A domain D consists of: a feature space X and a marginal probability distribution P(X) , where X = {x 1 , ..., x n } ∈ X . The feature space can be defined as a collection of features related to specific properties of the data which are given as input to the model. Given a specific domain, D = {X , P(X)} , a task consists of two components: a label space Y and an objective predictive function f : X → Y . The objective predictive function aims to predict the label f (x) of each new instance x.
Finally, according to Lu et al. 40 there are three main categories of TL methods. Inductive TL assumes that the learning task in the target domain is different from the learning task in the source domain. Unsupervised TL, also assumes that the learning task in the target domain is different from the learning task in the source domain, but focuses only on unsupervised problems such as clustering and density estimation. Finally, transductive TL assumes that the learning tasks are the same in both domains, while the source and target domains are different, but related. The proposed TL approach for the problem of PV production forecasting belongs to the field of transductive TL, because the source and target tasks are the same (hourly PV prediction), while the source and target domains are different in terms of location, nominal capacity and weather conditions. The long short-term memory model. One of the most suitable models for the application of TL in the PV production forecasting problem is the LSTM model 41 . This is mainly due to the fact that the functionality of the LSTM depends on weight updating between the neurons of the deep learning model, allowing the creation of pre-trained models. Thus, it facilitates pre-training the model on the baseline PV in order to utilise the saved weights of the pre-trained model and apply TL on the target PV. The same applies for other NNs, but LSTM networks have shown the best performance, and the interest in PV power prediction using variations of LSTM networks has been continuously increasing over the past few years 42 .
The LSTM is a RNN architecture with the innate ability of capturing long term dependencies in sequence prediction problems 43 . The purpose of the LSTM development has been the vanishing gradient problem, which can be described as the exponential increase (or decrease) of the backpropagated error signal as a function of the distance from the final layer, resulting in models which are unstable and incapable of efficient learning 44 . The LSTM uses an additive gradient structure which incorporates direct access to a forget gate enabling the network to stimulate desired behaviour from the error gradient 45 .
The selection of LSTM over traditional ML algorithms and feedforward NNs is based on its suitability for holding long term memory, which is essential when facing problems with sequential data with temporal relationship. LSTM is able to represent the dynamic performance of systems, thus being one of the most widely models for dealing with time series problems, such as PV production forecasting 46 . LSTMs provide a significant advantage over other methods, as they are able to detect linear relationships between nonlinear data. Such relationships www.nature.com/scientificreports/ may appear in the PV production forecasting problem, between power output and meteorological data. In this respect, the LSTM can benefit from features and detect relationships and patterns that other models would not be able to find. Furthermore, LSTM has been exploited for several energy-related time series problems, including residential energy consumption predictions 47,48 , and natural gas demand forecasting 49 . The architecture of the LSTM cell is illustrated in Fig. 3. The presentation of the LSTM architecture follows the works of Graves 50 and Olah 51 . The standard LSTM cell includes four NN layers, differing from common RNN architectures which include a single layer. Each line in Fig. 3 represents a vector to which several pointwise operation and NN layers are performed. The LSTM cell receives three inputs and produces two outputs. The inputs, passed in vector form, are the following: the current input x t , the previous hidden state h t−1 and the previous cell state c t−1 . The outputs of LSTM are the cell state and the hidden state. On the one hand, the cell state (depicted by the horizontal line at the top) encapsulates the long term memory capability of processing information of more distant events. On the other hand, the hidden state transfers information from immediately previous events and it is overwritten at every step. The core functionalities of the LSTM cell are implemented through its three gates: the forget gate, the input gate and the output gate.
1. The forget gate is the first block represented in the LSTM architecture. The forget gate determines which part of the information must be retained or discarded. The inputs of this gate are the previous hidden state h t−1 and the the current input x t . These inputs are passed through the sigmoid function σ g which results in output values between 0, which denotes that no information passes through, and 1, which denotes that all information passes through. The forget gate's activation vector f t is given by the following equation.
2. The input gate serves as an input to update the cell status. The input gate's functionality is performed in two parts. Firstly, the previous hidden state and the current input are passed into the second sigmoid function σ g . Secondly, the same inputs are passed into the hyperbolic tangent function σ c in order to regulate the network. Finally, the cell state vector c t is the result of the element-wise product of the cell input activation vector c t and the update gate's activation vector i t . The input gate's activation functions are given by the following equations.
3. Finally, the output gate determines the next hidden state h t . The hidden state includes information on previous inputs and it is utilized for prediction. The previous hidden state h t−1 and the current input x t are passed into the third sigmoid function σ g . Then, the modified cell state is passed to the hyperbolic tangent function σ h . These outputs are multiplied element-wise allowing the network to determine which information the hidden state should carry.
The parameters W ∈ R h×d , U ∈ R h×h and b ∈ R h represent weight matrices and bias vector parameters respectively, which are learned during the training process.
(1) lems is attributed to their complex architecture. Deep NNs incorporate the concept of hierarchy due to the connection of multiple layers of neurons. Each layer is responsible for solving a small task of the main problem and its output is transferred to the next layer 52 . The solution to the problem is produced by the last layer of the network. The intermediate layers of deep NNs are called hidden layers. The main idea of introducing hidden layers to the architecture is that each hidden layer generates more advanced representations of the problem leading to higher abstraction levels. Thus, deep NNs can represent any non-linear function with relatively fewer neurons than a single-layer network. DL assumes that a hierarchical model with many layers is exponentially more efficient at approximating some functions than a more shallow one 53 . This approach can also be applied to LSTMs. The original LSTM model is composed of one single layer which receives the input data and passes the output signal to a single feed-forward output layer. However, in this study an alternative architecture is proposed, which involves multiple hidden layers of multiple LSTM units followed by a feed-forward output layer. Each layer provides a sequence output to the next layer, rather than a single value output. This architecture is called stacked LSTM network, and it has been introduced by Graves et al. 54 in their application of LSTMs to speech recognition. Proportional to simple feed-forward networks, stacked LSTM networks result in deeper models with higher levels of approximation accuracy. Moreover, due to the fact that LSTMs are used with sequence data (their hidden state is a function of all previous hidden states), deeper architectures lead also to deeper level of abstraction of the input data over time providing a representation of the task at different timescales 55,56 .
TL is exploited through the process of reusing the weights of a model which has been trained on the source domain data to fine-tune a new model based on the target domain data. The pre-trained model is referred to as the base model, while each new model in the target domain is referred to as TL model. The weights of each layer of the base model can be processed differently in order to provide better performance of the TL model in the target domain, using the following approaches: (a) keep the weights of the layer fixed, (b) fine-tune the weights of the layer based on the target domain data and (c) train the weights of the layer from scratch based on the target domain data.
In this paper, three TL strategies are developed and compared in terms of forecasting accuracy for the problem of PV production forecasting.
• TL Strategy 1: In the first strategy, the weights of the initial layers are frozen and the only trainable weights are the weights of the last hidden layer. This strategy is known as weight freezing and it is widely used in order to extract features from the source domain and carry them to the target domain. This is a widely used scheme when treating images, where the first layers are used as feature extraction layers and the last layers are used to adapt to new data. • TL Strategy 2: In the second strategy, the base model is used as a weight initialization scheme for the TL model. The weights of all layers of the TL model are initialized based on data from the source domain and they are fine-tuned based on data from the target domain. This approach is extensively used with problems where there is an abundance of data in the source domain, but a scarcity of data in the target domain. However, a high degree of similarity between the source and the target domain is a necessary condition. • TL Strategy 3: In the third strategy, the initial layers of the TL model are frozen and the last layer is trained from scratch, popping the last layer of the base model and adding a new layer after the frozen layers. This approach is similar to the first one, but it differs in the fact that the weights of the last layer are not initialized based on data from the source domain. Thus, the TL model serves as a feature extraction mechanism because of the frozen layers, but it can also be fine-tuned to the target domain because of the random initialization of the last layer's weights.

Results
The results of this study are presented in three categories, namely; the forecasting performance of the baseline stacked LSTM model, the TL models performance results compared to the conventional model in the target domain and the results of applying TL with different volume of available data, respectively. By the term conventional model we refer to the LSTM model in which no TL has been applied; in this context, the conventional LSTM model is solely based on training with data from the target PV.
Baseline model performance. The stacked LSTM model has the following lag features: (a) Power output measured value, (b) air temperature, (c) global horizontal irradiance, (d) humidity, (e) month of the year (in the form of one-hot encoding) and (f) hour of the day (in the form of sine/cosine transformation). The abovementioned features are fed into the LSTM model in the format of "5 inputs -1 output" of hourly data. More specifically, a point value for each feature is fed into the model for the last five hours and the PV power output for the next hour is predicted (one-hour ahead power output forecast).
Ensuring an accurate base model is a prerequisite for achieving accurate predictions in the target domain. In this context, the performance of the LSTM model for the base PV is evaluated with the following procedure: The base PV dataset is split into train set and evaluation set using a 80-20 split, keeping the first 80% as training and the remaining 20% as testing (17563 observations for the training process and 4391 observations for evaluation purposes) and the LSTM model is trained on the training set. The accuracy of the model is evaluated by computing the root mean squared error (RMSE) and the mean absolute error (MAE) of the respective forecasts across the evaluation period considered, as well as the coefficient of determination R 2 between the forecasts and the real values, as follows: where y t is the real value of the solar production time series at hourly interval t of the evaluation period, ŷ t is the produced forecast of the model and ȳ is the average of the real values. Apart from these error metrics, two additional metrics are calculated in order to make the model evaluation more complete: the Mean Bias Error (MBE) and the normalized root mean squared error (NRMSE). The MBE represents the systematic error of a forecasting model to under or overforecast, while the NRMSE is suitable for the comparison between models of different scales connecting the RMSE value with the observed range of the variable. These two metrics are calculated as follows: The model achieves high accuracy, managing to efficiently capture the daily patterns of the most important variables, as reflected by the utilized metrics ( MAE = 0.467 , RMSE = 0.992 , MBE = − 0.097 , nRMSE = 0.301 , R 2 = 96.254% ). However, even these five error metrics are not enough to sufficiently illustrate the capabilities of the proposed model in comparison with other models in different geographical locations. According to Yang et al. 57,58 , the accuracy of solar forecasting models (in general, the term "solar forecasting" may refer to either solar irradiance forecasting or solar power forecasting; throughout this study the term refers to solar power forecasting) must be inter-comparable across different locations and different time periods through a common metric which is the forecast skill index. The forecast skill index is based on the comparison of the proposed model to a reference model on a specific error metric. However, two issues arise: What reference model and which error metric must be used? The most common reference model to standardize the verification of solar forecasting models is the persistence model. More specifically, the utilization of a smart persistence model as a reference model is highly recommended, rather that using the naive (or simple) persistence model 59 . Regarding the optimal error metric, the RMSE is the most suitable metric in the case of solar power production, as a metric that is appropriate for capturing large errors 57 . Thus, the formula of the forecast skill index is the following: where RMSE proposed is the RMSE value of the developed LSTM model and RMSE reference is the RMSE value of the smart persistence model.
The last question that arises concerns the selection of the smart persistence model in the case of solar power forecasting. For solar irradiance forecasting problems, the smart persistence model derives from integrating clear sky conditions to the reference model 59 . The same also applies to PV power forecasts, where several smart persistence models have been proposed 60 . More specifically, a clear sky index has been proposed by Engerer and Mills in case that the characteristics of the PV panel are known 61 , while another PV smart persistence model based on scaling global horizontal irradiance to PV production value has been presented by Huertas and Centeno 62 . In this study, the definition of Pedro and Coimbra is adopted, which is based on estimating the expected power output under clear-sky conditions 63 . The formula of the adopted smart persistence model is described by the following equation: where y(t) is the measured power output and y c−s (t) represents the expected power output under clear-sky conditions. The purpose of this model is to decompose power output, indicating that a fraction of the power output relative to the clear-sky conditions remains the same between short time intervals. Moreover, at night conditions the forecast of the smart persistence model is considered equal to the clear sky power output. The approximated function for the clear-sky model can be created by averaging past power output values depending on the hour of the day (between 0 and 23) and the day of the year (between 0 and 255). The second step involves Although the smart persistence model shows quite good performance in comparison with the naive persistence model ( RMSE Naive = 1.985 , MAE Naive = 1.110 ), it is evident that the LSTM significantly outperforms the smart persistence model. This is also highlighted through the forecast skill index of the LSTM model which is equal to 0.221. A positive forecast skill index indicates that the proposed model outperforms the smart persistence model, while a negative one shows that the smart persistence model performs better.
Finally, Fig. 5 depicts the results of the forecasting models (LSTM baseline model and smart persistence model) for two different periods. It can be concluded that the model manages to capture seasonality, trends and weather-related variations both in summer and winter periods, and thus offer significantly better forecasts compared to the smart persistence model. Transfer learning methods. The TL models are equipped with exactly the same characteristics as the baseline model, using the baseline pre-trained model to solve exactly the same problem, with the same features and the same expected output, in a different PV plant. Therefore, the features of the TL models are: (a) Power output measured value, (b) air temperature, (c) global horizontal irradiance, (d) humidity, (e) month of the year (one-hot encoding) and (f) hour of the day (sine/cosine transformation) and the model output is a one-hour ahead forecast of the PV power output.
The validation process of the proposed TL strategies is implemented in 6 PV plants, with different nominal and peak capacity, located in 4 cities in Portugal. Four architectures are compared, including the presented TL strategies, as well as a conventional model where no TL has been applied. For the TL models, a pre-training is applied on the whole dataset of the base PV (30 months of data). Then, the four models are trained using one year of data (8760 h) and they are tested in the rest of the dataset. For each PV plant the size of the test dataset  www.nature.com/scientificreports/ is different depending on data availability, as presented in Table 1. The models' accuracy is evaluated based on their performance on the evaluation data using RMSE, MBE, MAE, NRMSE and R 2 . 20 training repetitions are performed for each model, in order to eradicate randomness. This number of repetitions is generally proposed in the literature. The forecasting performance for all models is presented in Table 2, where the average values of RMSE, MBE, MAE, NRMSE and R 2 are reported, providing some very useful insights.
Firstly, it is worth mentioning that all LSTM models perform better than the smart persistence model in terms of RMSE. This fact illustrates the suitability of the selected model and the selected features for this problem. The only case that the LSTM performs worse than the smart persistence model is for the conventional LSTM of PV 2 . Even in this case the three TL models have lower error indexes than the smart persistence one. The forecast skill index varies between − 0.15 (it is negative in the case of PV 2 ) and 0.48 for the conventional model, while it varies between 0.28 and 0.56 for the TL strategies. The average percentage increase of the forecasting skill index between the conventional and the TL models is 16.3% . Finally, the MBE index shows that none of the developed models shows any indication of bias.
Regarding the comparison between the conventional LSTM and the three TL models, the impact of TL is evident as TL strategies have better accuracy than the conventional one for all six PVs . The boxplots presented in Fig. 6 also show that the conventional LSTM has greater RMSE average value in all target PV plants, while it also demonstrates a bigger variance compared to the TL models. Indeed, the models that are used without TL suffer from high variance, offering considerably different accuracy in each repetition. On the contrary, models trained with the three TL strategies show nearly zero variance, while also achieving more accurate, non-biased forecasts. A remarkable point is that for PV 3 , where the evaluation period is only 38 days (910 hourly point forecasts), the three TL models do not seem to outperform the conventional model in the extent that they do for the other PV plants. This is due to the fact that the evaluation takes place solely on March (winter period) where the problem is more complex as weather patterns are often disturbed, while another sign illustrating the forecasting difficulty in PV 3 is that neither the smart persistence model is able to make better forecasts. www.nature.com/scientificreports/ Data availability impact. As mentioned in the introductory section, one calendar year of data is the minimum time interval for a model to be sufficiently trained, in order to incorporate all seasonal and weather patterns of the problem. Also, the presented results indicate that TL models can perform better than conventional models considering a scenario where one year of training data is available, while conventional models are still better than reference smart persistence ones. However, the application of TL offers the possibility to obtain reliable and accurate predictive models, even when the available training data for the target domain are less than one year. In this context, the proposed architectures are compared on the target PVs for different training periods, namely 3 months, 6 months and 9 months of available data. It must be noted that, although the training period has changed, the testing period has been kept the same for comparison purposes between the different scenarios. Figure 7 presents the RMSE index of the four models in the four aforementioned scenarios of different training periods. Results indicate that the TL models are more robust considering different volumes of training data and that their performance slightly improves when more data are available. This can be contributed to their anterior training on the base PV over 3 years of hourly data. On the other hand, the impact of data scarcity is apparent for the conventional LSTM model, which radically improves when the training period increases and identifies new seasonal and weather patters. It is worth mentioning that none of the 3-month trained conventional models outperforms the smart persistence model, while only three 6-month trained conventional models manage to achieve better accuracy compared to the smart persistence one. The same does not apply for the 3-month trained TL models, which have lower RMSE compared to both the conventional LSTM and the smart persistence model. Finally, the difference in terms of RMSE between the conventional model and the best-performing TL model decreases as more training data are becoming available. This is evident in all six target PVs. For example, the difference in terms of RMSE in PV 5 is limited from 150.5% (3-month training models) to 15.1% , about 10 times lower. Same decrease patterns are also identified in the other five PVs, further highlighting the importance of TL, especially when less than one calendar year of data is available.

Discussion
Collecting sufficient data from recently installed solar plants is a long process. The urgent need for accurate PV production forecasts has led to the idea of exploiting solar plants with sufficient data to provide accurate forecasts for recently installed ones. In this paper, the purpose of the research is to determine whether TL can be efficiently employed to provide PV production forecasts for solar plants with limited data size. Three TL strategies based on a stacked LSTM architecture are developed and compared to a non-TL approach. The presented methodology is tested in power plants in different cities and with different nominal capacities. The findings of the experimental application indicate that all three TL strategies significantly outperform the non-TL approach in terms of forecasting accuracy, evaluated by several error indexes.
Moreover, the models are compared with a smart persistence model based on the clear-sky power output. The models which are trained with the three TL strategies significantly outperform the reference model, having a forecast skill score between 0.28 and 0.56 considered satisfying by the existing literature. On the opposite side, Figure 6. Boxplot that summarizes the performance of the four stacked LSTM models for the six target PV plants based on the RMSE. Base stands for the model that no TL has been applied, while TL1, TL2, TL3 stand for the three TL strategies, respectively. year of data is available. Results of additional experiments using varying volumes of training data suggest that the less data available, the greater the gap between TL strategies and the non-TL approach, further necessitating the use of TL. Especially in the scenario that 3 months of data are available for training, the gap between the conventional model and the TL ones significantly increases, while the conventional LSTM fails to outperform even the reference model. Last but not least, one of the most significant parameters in the concept of TL is the replicability of the presented TL strategies. In order to assess this aspect, the experimental application has been performed on three PV panels located in the same city as the base PV and on three PV panels located in different cities. This enables comparison between the two groups in terms of forecasting accuracy. However, the fact that these PV panels are located in different cities and have different nominal power indicates that the comparison must take place on the forecast skill index, to be in alignment with the proposed verification guidelines for deterministic solar forecasts. Thus, the average forecast skill index for PV panels in the same city is 0.4, while the corresponding value for PV panels in different cities is 0.43. This is undoubtedly a sign that the forecasting accuracy of the models is not affected by the geographical distance between the base and the target PV.
This study is the first step towards enhancing our understanding of the impact of TL on solar plant power prediction. Future work will concentrate on assessing the impact of the base model's training data volume, investigating whether training base models with more data or with data from different solar plants could further improve forecasting accuracy. This could result in the evolution of cross-stakeholder models and data sharing among energy communities, with the aim to promote inclusiveness in smart cities environments. Further studies, which take geographical characteristics differences between the base and the target domain into account (i.e., altitude, solar plant orientation), will also need to be performed. Finally, the prospect of being able to use TL for solar plants power forecasting, serves as a continuous incentive for future research on transferring knowledge in similar problems, such as cross-building energy forecasting, wind power prediction, and hydraulic plant generation forecasting, among others.

Data availibility
The meteorological data included in this study are available within the Copernicus Atmosphere Data Store -via CAMS Solar Radiation time series database (https:// ads. atmos phere. coper nicus. eu) and the Weather Underground (https:// www. wunde rgrou nd. com) website. User registration required (free) for downloading. All power output data analyzed during this study can also be provided upon request.

Code availability
The deep learning-based pre-trained models for PV power output forecasting and the source code required for reproducing the results of this study are available at https:// github. com/ Eliss aiosS armas/ Trans fer-learn ing-strat egies-for-solar-power-forec asting-under-data-scarc ity.